Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added a BC: workspace document #2197

Merged
merged 17 commits into from
May 10, 2021
Merged

Conversation

iesahin
Copy link
Contributor

@iesahin iesahin commented Feb 15, 2021

This doesn't fix any issue completely.

  • Added workspace.md doc to basic-concepts dir. Expanded this from previous iterations.

  • Added Basic Concepts section to the sidebar and added this document.

  • tooltip definition is not changed.

This PR closes #1127 and is related to #550.

@shcheklein shcheklein temporarily deployed to dvc-org-iesahin-issue53-hfg3kw February 15, 2021 08:35 Inactive
@@ -6,3 +6,46 @@ tooltip: >-
models, etc. Typically, it's also a Git repository. It will contain your DVC
project.
---

<!-- keywords: data science project architecture, machine learning project architecture, machine learning workflow, data science workflow, machine learning file system, data science file system, data science project structure, machine learning project structure, notebook version control -->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note to remove this comment later before merging (but for now it's useful to have the list during review 👍)

This comment was marked as resolved.

This comment was marked as off-topic.

Copy link
Contributor

@jorgeorpinel jorgeorpinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please bear with us as we review this first concept as we will probably establish some criteria for all concepts along the way. Some general comments below to begin with:

Comment on lines 12 to 17
# Workspace

A data science project consists of data obtained from many different sources.
Most of the time it needs to convert the the format of this data into a form
that is required by the training models and supplying to data science / machine
learning workflows. Sometimes it needs to be split into multiple files or
Copy link
Contributor

@jorgeorpinel jorgeorpinel Feb 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK to have some context but it should probably be a short intro paragraph only? Maybe just say something about project structures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think in BC documents, we need to make separation of concerns. If our aim is a quick introduction to DVC concepts, we don't need such story material to begin with. This was solely for SEO purposes.

IMHO I can write blog posts / tutorials and tell stories with all kinds of SEO phrases there instead of trying to use these in docs. As you can see it's possible but I don't think it increases the quality of docs.

I can paraphrase into a single paragraph or remove them completely. I'm fine with all.

This comment was marked as resolved.

Copy link
Contributor

@jorgeorpinel jorgeorpinel Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to make separation of concerns...

We do already have tooltips for shot definitions. But you can't search for "pipeline dvc" on Google and find a tooltip (i.e. we want to have landing pages for our docs).

Another goal of these is to extract explanations out of the cmd refs (e.g. in https://dvc.org/doc/command-reference/dag which shouldn't be where pipelines are explained)

In my mind most concept pages are 2-3 paragraph long in general (we'll see as they come). There could also be some cool diagrams potentially (feel free to contribute sketches, we can fix them up later). They can still have a story, maybe just a very straightforward one? (motivation -> explanation -> uses/implications). WDYT @shcheklein ?

I'm putting link to the last of these highlighted project words.

p.s. I usually "tooltipy" just one instance of the term per page or big section, but I try to make it the first or 2nd one (as long as it doesn't distract e.g. no other links are nearby).

Copy link
Contributor

@jorgeorpinel jorgeorpinel Feb 17, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

p.p.s motivation -> explanation -> uses/implications - I see that's more or less the structure here already 👍

content/docs/user-guide/basic-concepts/workspace.md Outdated Show resolved Hide resolved
content/docs/user-guide/basic-concepts/workspace.md Outdated Show resolved Hide resolved
@shcheklein shcheklein temporarily deployed to dvc-org-iesahin-issue53-hfg3kw February 16, 2021 17:21 Inactive
@jorgeorpinel

This comment has been minimized.

@shcheklein shcheklein temporarily deployed to dvc-org-iesahin-issue53-hfg3kw February 18, 2021 08:08 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-iesahin-issue53-hfg3kw February 18, 2021 08:49 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-iesahin-issue53-hfg3kw February 18, 2021 08:59 Inactive
@iesahin

This comment has been minimized.

@iesahin iesahin mentioned this pull request Feb 21, 2021
14 tasks
contents through DVC commands.

Files and directories in the workspace can be added to DVC (`dvc add`) or they
can be downloaded from external sources (`dvc get`, `dvc import`,

This comment was marked as resolved.

This comment was marked as resolved.

This comment was marked as resolved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But actually add can also download data to the workspace (see --out and --to-remote options). Also, import* commands download AND track data. You may want to rephrase this part accordingly 🙂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sometimes I think there is no command that hasn't got a duplicate, somehow :) I try to mention commands in passing, if we'd consider each and every option to commands, we'll need to duplicate the command reference here IMHO.

We may just delete the commands if you would like.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can also list all possibilities for each functionality, like

In the workspace, you can

  • Import files and directories (dvc add --out, dvc import-url...)

but I think this will turn the document into a list of commands and options.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to list every command usage of course, agreed!

My point was that these 3 commands mentioned actually overlap in a way that makes the current text slightly incorrect. In any case, the main use case of add is not to "add" but to "track", actually. Please check each cmd ref to try to find the right terms when needed 🙂

"Download" is correct for get/import but add can also download (and they can all "transfer") so I'd avoid that term probably. And in fact I wouldn't even mention get here, since it doesn't require a DVC project/workspace. For import I'd try to use the cmd name as the relevant action (to "import") I guess...

Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the main use case of add is not to "add" but to "track"
For import I'd try to use the cmd name as the relevant action (to "import")

then again import* also track the downloaded data 😅 ("adds"). Maybe it should be a single sentence about tracking and put all add, import, import-url in the same parenthesis.

@shcheklein shcheklein had a problem deploying to dvc-org-iesahin-issue53-hfg3kw February 22, 2021 18:35 Failure
@shcheklein shcheklein had a problem deploying to dvc-org-iesahin-issue53-hfg3kw February 22, 2021 18:43 Failure
@shcheklein shcheklein had a problem deploying to dvc-org-iesahin-issue53-hfg3kw February 22, 2021 19:34 Failure
@shcheklein shcheklein had a problem deploying to dvc-org-iesahin-issue53-hfg3kw February 22, 2021 19:43 Failure
@shcheklein shcheklein temporarily deployed to dvc-org-iesahin-issue53-hfg3kw February 26, 2021 17:06 Inactive
@iterative iterative deleted a comment from iesahin Feb 27, 2021
@jorgeorpinel
Copy link
Contributor

OK I see I neglected this accidentally. Sorry, checking now @iesahin 👍

Comment on lines +13 to +16
These may be split into multiple files or directories or (as the project
structure needs) have different versions for different requirements, e.g., a
smaller / simplified version might be required in prototyping for faster
feedback and shorter training times. A single workspace to manage all artifacts
Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or have different versions for different requirements...

Let's not go into versioning here, I think. At least not by implying they're all in the workspace because in DVC the workspace only holds one version (the rest are cached and managed via Git, metafiles, etc.

(Mentioned in #2197 (comment))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

versioning needs and managing dependencies make it increasingly difficult

p.s. This is better way to very subtly mention versioning (could even link to the corresponding Use Case doc).

of a project is desirable, although versioning needs and managing dependencies
make it increasingly difficult.

DVC allows a single directory to contain all your project artifacts. The
Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a single directory to contain all your project artifacts

Not exactly. File contents are org'd in the cache with a special file structure (see https://dvc.org/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory)

The workspace is the directory containing the visible part of your project

That part is correct. And contradicts the previous part 🙂 (because "visible part" implies there's a hidden part which must be in other dirs).

Let's open this p with that sentence.

Comment on lines +21 to +22
workspace is the directory containing the _visible_ part of your
<abbr>project</abbr>, e.g., the raw data, source code, model files. You can have
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"e.g., the raw data, source code, model files you're currently using"

Comment on lines +22 to +24
<abbr>project</abbr>, e.g., the raw data, source code, model files. You can have
multiple versions of data, models, and other kinds of artifacts within the
workspace and limit your focus to a subset of these. You can record your
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can have multiple versions of data ... within the workspace

Again contradicting 😕

Comment on lines +24 to +25
workspace and limit your focus to a subset of these. You can record your
progress in a commit and analyze your data and model history. DVC provides a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

record your progress in a commit and analyze...

Again probably too many details about versioning. Doesn't really fall within the 'workspace' concept, I think. This can be simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to rename your models for minor changes...
or save tens of different renamed files for training

Those are better mentions of versioning (clear benefits i.e. how much you'd have to suffer without DVC)

save cleaned up data in different directories

That specific example isn't great because that's still pretty common even with DVC (e.g. in our own example-get-started repo we have a prepared/ dir).

Comment on lines +25 to +26
progress in a commit and analyze your data and model history. DVC provides a
_machine learning file system_ to manipulate your data and models using its
Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This "ML file system" keyword is pretty tricky. No need to force it (just skip it if you can't find a correct way to use it).

I can only think of something like "DVC turns your project into a sort of machine learning file system for..." but not sure.

Comment on lines +29 to +30
programs. DVC can keep track of all of these in a single directory called the
workspace.
Copy link
Contributor

@jorgeorpinel jorgeorpinel Mar 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DVC can keep track of all of these in a single directory called the workspace.

Again contradicting and also, repetitive at this point.

Comment on lines +39 to +40
DVC supports all typical operations of a versioned data file system through its
commands. Behind the scene these operations use <abbr>metafiles</abbr> like the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DVC supports all typical operations of a versioned data file system through its commands.

Maybe open the previous paragraph with that?

p.s. having this I think def. no need for the "ml file system" keyword. But keeping "machine learning" somewhere would be nice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try to incorporate the metafile mentions to the main (2nd) paragraph somehow. After simplifying it per my previous comments, there should be enough room in there. that way there's no need for this 4th p.

Copy link
Contributor

@jorgeorpinel jorgeorpinel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More feedback 👍

@jorgeorpinel
Copy link
Contributor

I guess I should take this over 👍

@jorgeorpinel jorgeorpinel self-assigned this May 6, 2021
@iesahin
Copy link
Contributor Author

iesahin commented May 7, 2021

Context change makes me miserable. I can take care of this after the GS project if you have other tasks. Thank you @jorgeorpinel

@jorgeorpinel jorgeorpinel changed the base branch from master to concept/workspace May 10, 2021 03:01
@jorgeorpinel
Copy link
Contributor

For some reason I can't fetch iesahin/issue53 so I'm going to merge this to a new branch I pushed instead (concept/workspace) to preserve @iesahin's commits, and reopen the PR from there to wrap it up.

@jorgeorpinel jorgeorpinel merged commit b2c9169 into concept/workspace May 10, 2021
@jorgeorpinel jorgeorpinel mentioned this pull request May 10, 2021
1 task
@iesahin iesahin deleted the iesahin/issue53 branch July 14, 2021 12:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

term: ambiguous use of "external" and "workspace"
3 participants